{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# bRAG: Basic (naive) RAG Implementation\n",
    "\n",
    "This notebook demonstrates a complete implementation of a basic RAG system that enables question-answering over PDF documents. \n",
    "The implementation is model, database, and document loader agnostic, though it's currently configured with:\n",
    "- LLM: OpenAI GPT-3.5-turbo\n",
    "- Vector Database: Pinecone\n",
    "- Document Loader: PyPDFLoader\n",
    "\n",
    "The system combines several key components:\n",
    "1. Document Loading: Loads PDF documents (extensible to other document types)\n",
    "2. Text Processing: Splits documents into manageable chunks\n",
    "3. Vector Operations:\n",
    "   - Embeds text using OpenAI's embedding model\n",
    "   - Stores vectors in Pinecone vector database\n",
    "4. Retrieval System: Implements efficient document retrieval\n",
    "5. LLM Integration: Uses OpenAI's GPT model for generating responses\n",
    "\n",
    "All components can be swapped out for alternatives (e.g., different LLMs, vector stores, or document loaders) \n",
    "while maintaining the same overall architecture.\n",
    "\n",
    "This implementation serves as a foundation for building more complex RAG applications\n",
    "and can be customized based on specific use cases.\n",
    "\n",
    "----------------------------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pre-requisites (optional but recommended)\n",
    "\n",
    "### Only do the first step if you have never created a virtual environment for this repository. Otherwise, make sure that the Python Kernel that you selected is from your `venv/` folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create virtual environment\n",
    "! python -m venv venv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Activate virtual Python environment\n",
    "! source venv/bin/activate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/taha/Desktop/bRAGAI/code/gh/bRAG-langchain/venv/bin/python\n"
     ]
    }
   ],
   "source": [
    "# If your Python is not from your venv path, ensure that your IDE's kernel selection (on the top right corner) is set to the correct path \n",
    "# (your path output should contain \"...venv/bin/python\")\n",
    "\n",
    "! which python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install all packages\n",
    "! pip install -r requirements.txt --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment\n",
    "\n",
    "`(1) Packages`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "# Load all environment variables from .env file\n",
    "load_dotenv()\n",
    "\n",
    "# Access the environment variables\n",
    "langchain_tracing_v2 = os.getenv('LANGCHAIN_TRACING_V2')\n",
    "langchain_endpoint = os.getenv('LANGCHAIN_ENDPOINT')\n",
    "langchain_api_key = os.getenv('LANGCHAIN_API_KEY')\n",
    "\n",
    "## LLM\n",
    "openai_api_key = os.getenv('OPENAI_API_KEY')\n",
    "\n",
    "## Pinecone Vector Database\n",
    "pinecone_api_key = os.getenv('PINECONE_API_KEY')\n",
    "pinecone_api_host = os.getenv('PINECONE_API_HOST')\n",
    "index_name = os.getenv('PINECONE_INDEX_NAME')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`(2) LangSmith`\n",
    "\n",
    "https://docs.smith.langchain.com/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ['LANGCHAIN_TRACING_V2'] = langchain_tracing_v2\n",
    "os.environ['LANGCHAIN_ENDPOINT'] = langchain_endpoint\n",
    "os.environ['LANGCHAIN_API_KEY'] = langchain_api_key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`(3) API Keys`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ['OPENAI_API_KEY'] = openai_api_key\n",
    "openai_model = \"gpt-3.5-turbo\"\n",
    "\n",
    "#Pinecone keys\n",
    "os.environ['PINECONE_API_KEY'] = pinecone_api_key\n",
    "os.environ['PINECONE_API_HOST'] = pinecone_api_host\n",
    "os.environ['PINECONE_INDEX_NAME'] = index_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`(4) Pinecone Init`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/taha/Desktop/bRAGAI/code/gh/bRAG-langchain/venv/lib/python3.11/site-packages/pinecone/data/index.py:1: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from tqdm.autonotebook import tqdm\n"
     ]
    }
   ],
   "source": [
    "from pinecone import Pinecone\n",
    "\n",
    "pc = Pinecone(api_key=os.environ['PINECONE_API_KEY'])\n",
    "index = pc.Index(os.environ['PINECONE_INDEX_NAME'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Full RAG App (Basic)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.vectorstores import Pinecone\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
    "from langchain.prompts import ChatPromptTemplate\n",
    "\n",
    "#### INDEXING ####\n",
    "\n",
    "pdf_file_path = \"test/langchain_turing.pdf\"\n",
    "loader = PyPDFLoader(pdf_file_path)\n",
    "\n",
    "docs = loader.load()\n",
    "\n",
    "# Split\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
    "splits = text_splitter.split_documents(docs)\n",
    "\n",
    "# Embed\n",
    "vectorstore = Pinecone.from_documents(\n",
    "    documents=splits, \n",
    "    embedding=OpenAIEmbeddings(model=\"text-embedding-3-large\"), \n",
    "    index_name=index_name\n",
    ")\n",
    "\n",
    "retriever = vectorstore.as_retriever()\n",
    "\n",
    "#### RETRIEVAL and GENERATION ####\n",
    "\n",
    "# Prompt\n",
    "template = \"\"\"Answer the question based only on the following context:\n",
    "{context}\n",
    "\n",
    "Question: {question}\n",
    "\"\"\"\n",
    "\n",
    "prompt = ChatPromptTemplate.from_template(template)\n",
    "\n",
    "# LLM\n",
    "llm = ChatOpenAI(model_name=openai_model, temperature=0.1)\n",
    "\n",
    "# Post-processing\n",
    "def format_docs(docs):\n",
    "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
    "\n",
    "# Chain\n",
    "rag_chain = (\n",
    "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
    "    | prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('This document is about LangChain, a modular framework for Large Language '\n",
      " 'Models (LLMs), focusing on its architecture, core components, capabilities, '\n",
      " 'challenges, security implications, and applications in NLP. It provides '\n",
      " 'insights into how LangChain can be leveraged to build innovative and secure '\n",
      " 'LLM-powered applications tailored to specific needs.')\n"
     ]
    }
   ],
   "source": [
    "# Question\n",
    "from pprint import pprint\n",
    "\n",
    "pprint(rag_chain.invoke(\"What is this document about?\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
